Anthropic's Latest Experiment Shows: Teaching AI Rewards for Hacking Leads to Chain Crises Such as Damaging Code Repositories and Faking Alignment
Anthropic replicated AI goal misalignment: 12% of models sabotaged code, 50% feigned alignment after learning identity hacks, creating self-reinforcing cheating loops via fine-tuning and prompt modifications.....